{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Kernel Density Estimation\n", "\n", "This notebook is an introduction into the practical usage of KDEs in zfit and explains the different parameters.\n", "*A complete introduction to Kernel Density Estimations, explanations to all methods implemented in zfit and a throughout\n", "comparison of the performance can be either found in\n", "[Performance of univariate kernel density estimation methods in TensorFlow](https://astroviking.github.io/ba-thesis/)\n", "by Marc Steiner from which parts here are taken or in the [documentation of zfit](https://zfit.readthedocs.io/en/latest/)*\n", "\n", "\n", "Kernel Density Estimation is a non-parametric method to estimate the density of a population and offers a more accurate way than a\n", "histogram.\n", "In a kernel density estimation each data point is substituted with a kernel function\n", "that specifies how much it influences its neighboring regions. This kernel functions can then be summed up to get an\n", "estimate of the probability density distribution, quite similarly as summing up data points inside bins.\n", "\n", "However, since\n", "the kernel functions are centered on the data points directly, KDE circumvents the problem of arbitrary bin positioning.\n", "KDE still depends on the kernel bandwidth (a measure of the spread of the kernel function), however, the total PDF\n", "does depend less strongly on the kernel bandwidth than histograms do on bin width and it is much easier to specify\n", "rules for an approximately optimal kernel bandwidth than it is to do so for bin width.\n", "\n", "## Definition\n", "\n", "Given a set of $n$ sample points $x_k$ ($k = 1,\\cdots,n$), kernel density estimation $\\widehat{f}_h(x)$ is defined as\n", "\n", "\\begin{equation}\n", "\\widehat{f}_h(x) = \\frac{1}{nh} \\sum_{k=1}^n K\\Big(\\frac{x-x_k}{h}\\Big)\n", "\\end{equation}\n", "\n", "where $K(x)$ is called the kernel function, $h$ is the bandwidth of the kernel and $x$ is the value for which the estimate is calculated. The kernel function defines the shape and size of influence of a single data point over the estimation, whereas the bandwidth defines the range of influence. Most typically a simple Gaussian distribution ($K(x) :=\\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2}x^2}$) is used as kernel function.\n", "The larger the bandwidth parameter $h$ the larger is the range of influence of a single data point on the estimated distribution.\n", "\n", "### Weights\n", "\n", "It is straightforward to add event weights in the above expression. zfit KDE fully support weights in data samples and will be taken into account automatically in the KDE.\n", "\n", "## Computational complexity\n", "\n", "This leads for a large n however to computational problems, as the computational complexity of the exact KDE above is given by $\\mathcal{O}(nm)$ where $n$ is the number of sample points to estimate from and $m$ is the number of evaluation points (the points where you want to calculate the estimate).\n", "\n", "To circumvent this problem, there exist several approximative methods to decrease this complexity and therefore decrease the runtime as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint on running the notebook**\n", "\n", "Feel free to rerun a cell a few times. This will change the sample drawn and gives an impression of _how the PDF based on this sample could also look like_." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import mplhep\n", "import numpy as np\n", "import zfit\n", "\n", "np.random.seed(23)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exact KDE\n", "\n", "Using the definition above of a KDE, this is implemented in the `KDE1DimExact`. We can start out with a simple Gaussian." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "obs_wide = zfit.Space('x', (-10, 10))\n", "x = np.linspace(-10, 10, 200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gauss = zfit.pdf.Gauss(obs=obs_wide, mu=0, sigma=2)\n", "sample = gauss.sample(60)\n", "sample_np = zfit.run(sample.value())\n", "\n", "kde = zfit.pdf.KDE1DimExact(sample,\n", " # obs,\n", " # kernel,\n", " # padding,\n", " # weights,\n", " # name\n", " )\n", "plt.plot(x, kde.pdf(x), label='Default exact KDE')\n", "plt.plot(x, gauss.pdf(x), label='True PDF')\n", "plt.plot(sample_np, np.zeros_like(sample_np), 'b|', ms=12)\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks already reasonable and we see that the PDF is overestimated in the regions where we sampled, by chance, many events and underestimated in other regions. Since this was a simple example, /et's create a more complitated one (and let's use a bit more samples in order to be able to infer the shape)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gauss1 = zfit.pdf.Gauss(obs=obs_wide, mu=0, sigma=2)\n", "gauss2 = zfit.pdf.Gauss(obs=obs_wide, mu=3, sigma=0.5)\n", "true_pdf = zfit.pdf.SumPDF([gauss1, gauss2], fracs=0.85)\n", "sample = true_pdf.sample(1000)\n", "sample_np = zfit.run(sample.value())\n", "\n", "kde = zfit.pdf.KDE1DimExact(sample,\n", " # obs,\n", " # kernel,\n", " # padding,\n", " # weights,\n", " # name\n", " )\n", "plt.plot(x, kde.pdf(x), label='Default exact KDE')\n", "plt.plot(x, true_pdf.pdf(x), label='True PDF')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is more difficult, actually impossible for the current configuration, to approximate the actual PDF well, because we use by default a single bandwidth." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bandwidth\n", "\n", "The bandwidth of a kernel defines it's width, corresponding to the `sigma` of a Gaussian distribution. There is a distinction between global and local bandwidth:\n", "\n", "